Thouless- Anderson-Palmer Approach for Lossy Compression 

Tatsuto Murayama 

RIKEN Brain Science Institute, Hirosawa 2-1, Wako, Saitama 351-0198, Japan 

(Dated: February 2, 2008) 

Abstract 

We study an ill-posed linear inverse problem, where a binary sequence will be reproduced using 
a sparce matrix. According to the previous study, this model can theoretically provide an optimal 
compression scheme for an arbitrary distortion level, though the encoding procedure remains an 
NP-complete problem. In this paper, we focus on the consistency condition for a dynamics model of 
Markov-type to derive an iterative algorithm, following the steps of Thouless- Anderson-Palmer's. 
Numerical results show that the algorithm can empirically saturate the theoretical limit for the 
sparse construction of our codes, which also is very close to the rate-distortion function. 
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Lossy compression is quite important in our modern life. One first encodes information 
into an appropriate form, which will be decoded to reproduce similar sequence. The the- 
oretical framework for this kind of compression scheme with a fidelity criterion is called 
the rate-distortion theory, which consists an important part of the Shannon's information 
theory 

We start by defining the concepts of the rate-distortion theory and stating the simplest 
version of the main result |2j. Let J be a discrete random variable with alphabet J . As- 
sume that we have a source that produces a sequence Ji, J 2 , • • • , Jm, where each symbol is 
randomly drawn from a distribution. We will assume that the alphabet is finit. Throughout 
this paper, we use vector notation to represent sequences for convenience of explanation: 
J = (Ji, J 2 , • • • , Jm) T £ J M ■ Here, the encoder describes the source sequence J G J M by 
a codeword £ = /(«/) G X N . The decoder represent J by an estimate J = g(£) G J M . 
Note that M represents the length of a source sequence, while N represents the length of a 
codeword. In this case, the rate is defined by R = N/M. Note that the relation N < M 
always holds when a compression is considered; therefore, R < 1 also holds. 

A distortion function is a mapping d : J x J — > R + from the set of source alphabet- 
reproduction alphabet pairs into the set of non-negative real numbers. Intuitively, the 
distortion d(J, J) is a measure of the cost of representing the symbol J by the symbol J. 
This definition is quite general. In most cases, however, the reproduction alphabet J is 
the same as the source alphabet J . Hereafter, we set J = J and the following distortion 
measure is adopted as the fidelity criterion; the Hamming distortion is given by 

{0 if J = J 
(1) 
1 if J + J 

which results in a probable error distortion, since the relation Ej[d(J, J)] = V[J ^ J] 
holds, where Ej[-\ represents the expectation and V[-} the probability of its argument. 
The distortion measure is so far defined on a symbol-by-symbol basis. We extend the 
definition to sequences. The distortion between sequences J, J G J M is defined by 
d(J,J) = (1/M) Ylu=i ^{J^ J^). Therefore, the distortion for a sequence is the average 
distortion per symbol of the elements of the sequence. The distortion associated with the 
code is defined as D = Ej[d(J, J)], where the expectation is with respect to the proba- 
bility distribution on J . A rate-distortion pair (R, D) should be achievable if a sequence 
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of rate-distortion codes (/, g) exist with Ej[d(J, J)] < D in the limit iV — > oo. Moreover, 
the closure of the set of achievable rate-distortion pairs is called the rate-distortion region 
for a source. Finally, we can define a function to describe the boundary; the rate-distortion 
function R(D) is the infimum of rates R, so that (R, D) is in the rate-distortion region of 
the source for a given distortion D. 

In this paper, we restrict ourselves to a binary source J with a Hamming distortion 
measure for simplicity. We assume that binary alphabets are drawn randomly, i.e., the 
source is not biased to rule out the possiblity of compression due to redundancy. We now 
find the description rate R(D) required to describe the source with an expected proportion 
of errors less than or equal to D. In this simplified case, according to Shannon, the boundary 
can be written as follows; the rate-distortion function for a binary source with Hamming 
distortion is given by 

\l-h 2 (D) 0<D<± 
R(D)={ ~ ~\ (2) 

[0 \<D 

where h,2(-) represents the binary entropy function. 

Next we introduce a toy model for the lossy compression. We use the inverse problem 
of Sourlas-type decoding to realize the optimal encodingscheme, a variation of which has 
recently been investigated by some information theorists 3j. As in the previous paragraphs, 
we assume that binary alphabets are drawn randomly from a non-biased source and that the 
Hamming distortion measure is selected for the fidelity criterion. Theoretically speaking, it 
has been reported that the typical distortion could be well captured by the Parisi one-step 
RSB scheme, giving the physical interpretaion of R(D) Q. In this paper, we will discuss an 
actual encoding technique for this optimal family of codes. 

Firstly, we take the Boolean representation of the binary alphabet J7", i.e., we set J = 
{0, 1}. We also set X = {0, 1} to represent the codewords. Let J be an M-bit source 
sequence, £ an iV-bit codeword, and J an M-bit reproduction sequence. Here, the encoding 
problem can be written as follows. Given a distortion D and a randomly-constructed Boolean 
matrix A of dimensionality M x N, we find the iV-bit codeword sequence which satisfies 

J = A£ (mod 2) , (3) 

where the fidelity criterion D = E[d(J, J)} holds, according to every M-bit source sequence 
J. Note that we applied modulo 2 arithmetics for the additive operations in ©. In our 



framework, decoding will just be a linear mapping j = A£, while encoding remains an 
NP-complete problem. 

Let the Boolean matrix A be characterized by K ones per row and C per column j^. 
The finite, and usually small, numbers K and C define a particular code. The rate of our 
codes can be set to an arbitrary value by selecting the combination of K and C. We also 
use K and C as control parameters to define the rate R = Kj C . If the value of K is small, 
i.e., the relation K N holds, the Boolean matrix A results in a very sparse matrix. By 
contrast, when we consider densely constructed cases, K must be extensively big and have 
a value of O(N). We can also assume that K is not 0(1) but K -C N holds. 

The similarity between codes of this type and Ising spin systems was first pointed out by 
Sourlas, who formulated the mapping of a code onto an Ising spin system Hamiltonian in the 
context of error-correcting codes p . To facilitate the current investigation, we first map the 
problem to that of an Ising model with finite connectivity following Sourlas' method. We use 
the Ising representation {1, —1} of the alphabet J and X rather than the Boolean one {0, 1}; 
the elements of the source J and the codeword sequences £ are rewritten in Ising values, 
and the reproduction sequence J is generated by taking products of the relevant binary 
codeword sequence elements in the Ising representation J M = n«e£(M) Here, we denote 
the set of codeword indexes i that participate in the source index \i by £(//) = {i|a Mi = 1} 
with A = (a^). Therefore, chosen i's correspond to the ones per row, producing an Ising 
version of J. Note that the additive operation in the Boolean representation is translated 
into the multiplication in the Ising one. Hereafter, we set J M , J^,^ = ±1 while we do not 
change the notations for simplicity. Furthermore, as we use statistical-mechanics techniques, 
we consider the source and codeword- sequence dimensionality (M and N, respectively) to 
be infinite, keeping the rate R = N/M finite. 

To explore the system's capabilities, we examine the Hamiltonian: 

M 

H(S\J) = Y,G[S\J,} , (4) 

11=1 

with 

G[S\J,] = -J, H Si (5) 

ie£(M) 

where we have introduced the dynamical variable Si to find the optimal value of and 
G[S\ Jy\ denotes the local connectivity of a random hypergraph neighboring the source bit 
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Jfj,. It is convenient to represent the posterior probability of codeword S given a source J 
in the form 



V{S\J) 



exp[-(3H(S\J) 

zU) 



(6) 



with the inverse tempresure j3, where Z(J) = Trgexpf— (3H (S\J)] is the partition function. 

We obtain an expression for the free energy per source bit expressed in terms of the 
probability distributions tt(x) and t}(x): 

-0/ = ^<O*Z(J)» 
= In cosh (3 



K 



] J n(.ri)(l.ri 



l=i 



K 



In J 1 + tanh j3 J Y\ tanh f3xi 



— K I ir(x)dx / fc(x)dx ln(l + tanh /3a; tanh/5x) 



C 
+ K 



c 



Y\_-n~(xi)dxi 



i=i 



In 



c 



S 1=1 



(7) 



where ((•••)) denotes the average over quenched randomness. The saddle point equations 
with respect to probability distributions provide a set of relations between ir(x) and 7r(x): 



71 (X) 



ir(x) 



c 



] J n{.ri)(i:r, 
i=i 

K 

Y\_n(xi)dx l 



c-i 
5 ix-^xi 



i=i 



i=i 
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K-l 



— — tanh 1 J tanh /3 J J^J tanh (3xi 



i=i 



(8) 
(9) 



By using the result obtained for the free energy, we can easily perform further straightforward 
calculations to find all the other observable thermodynamical quantities, including internal 
energy: 

1 d 



^{(Tr s H(S\J)e-^)) = -^(^Z(J))) 



(10) 



which records reproduction errors. This set of equations (jSJ) and © may be solved numeri- 
cally for general (3, K, and C. The spin glass solution can be calculated for both the replica 
symmetric and the one step RSB ansatz. The former reduces to the paramagnetic solution 
(/rs — — 1)? which is unphysical for R < 1, while the latter yields continueous distribu- 
tions tt(x) and tt(x) at the freezing point j3 g , which can be obtained from the root of the 



equation enforcing the non-negative replica symmetric entropy. The Random- Energy- Model 
limit K, C —>■ oo and simple algebra gives the relation between the rate R = N/M and the 
distortion D in the form R — 1 — hoiD), which coincides with the rate-distortion function 

n 

in the Shannon's theorem |^]. 

We now take the Thouless-Anderson-Palmer approach to build a dynamics model using 
the Markov process assumption on prior beliefs, showing that it is possible to obtain a closed 
set of equations for practical encoding. At this point, we assume a mean field behavior for 
the dependence of the dynamical variables S on a certain realization of the source sequence 
J, i.e., the dependence is factorizable and might be replaced by a product of mean fields. 
Furthermore, we treat the Boltzmann weights for specific codeword bit S{ are factorizable 
with respect to the source bit J M []]]. On the other hand, from a physicist point of view, it 
is natural to introduce the Markov process assumption on the priors to find a solution in 
the spin glass state. This term can be considered as the prior knowledge at a certain time 
t + 1, given the previous one on the variables at t. Hereafter, we introduce a parameter 
t — 1, 2, • • • to represent time evolution. In this scenario, we can derive a set of consistency 



equations: 



m{J,\s^ {W) = Y, e-" Gt{sw U v t(sA{J^}) , (11) 

Vt+iiSiHJ^}) =a^Q t+ i(Si) Yl Wt(J^>\Si,{J^^}) , ( 12 ) 

/i'£M(i)\n 



with 



Qt+i(S) = (exp{aS + tanh 1 7 • SSi))^^^}) (13) 

where is a normalization factor. In (|13|). we have two parameters to determine; a denotes 
the ferromagnetic bias and 7 introduces the autocorrelation for sequences. 

Here, we introduce another set AA{i) such that it defines the set of source indexes linked 
to the codeword index i. Equation (|TT|) evaluates the average influence of the newly added 
parity bit J M to Si, when {Si'\i' G C(fi)\i} obeys a posterior distribution, which should be 
determined by the rest of data set {J^\fi' G A4(i)\fi}. This calculation corresponds to the 
cavity method in the conventional framework, representing the effective Boltsmann weight 
Wt(Jfi\Si(t), { J m y m }) produced by J M , in which the self-induced contributions are eliminated 
by assuming the tree description for loopy interactions [8j. On the other hand, equation ()12|) 
indicates the stack of the cavity fields determines the posterior distribution V t (Si\{ J^/^}). 




FIG. 1: Snapshots of probability distributions for K = 2, L = 3 and (5 g = 2.35. (LEFT) 
Stable solution of (JHJ) and @ is calculated by Monte Carlo methods. We use 10 5 bin models to 
approximate the probability distributions ir(x) and vr(x), starting from various initial conditions. 
(RIGHT) Fixed-point condition for the density evolution of (|19l) and Q2UJI is represented in terms 
of the probability distributions in the same bin model. We use the relation m^i = tanh (3y and 
rh^i = tanh/3y, where the variables are assumed to be generated from common densities p and p, 
respectively. 

In this formula, the approximated marginal posterior will be 

V t +i(Si\J) = aiQ t +i(Si) J] WtiMSi^J^}) , (14) 

neM(i) 

taking the full set of the cavity fields, determined self-consistently by (jllj) and (fT2j) . into 
account, where aj is a normalization factor again. 

Next, we present more convenient form of the above equations. The conditional proba- 
bility V t (J^\Si, {J^^l^}) is a normalized effective Boltzmann weight: 

Vt{Jn\Si, Un'^n}) = b^WtiJ^Si, {J M '^}) 

where b^ is a normalization constant. This relation is obtained by taking the connection fi 
out of the system, and taking into consideration the dependance of the variables S on all 
other connections. 
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The identity 



e -/?Gt(0[J„) 



i cosh(/3J M ) • I 1 + tanh(/?J M ) J] ^ (16) 



and simple algebra with respect to the newly-defined variables m^it), m^it) G [— 1,+1] 
satisfying the relations: 

WKW) = 1 + m g® Si , (17) 

y«(w,{w)= i+ y )g< as) 

give the set of consistency equations in the form 



(t + 1) = tanh(/3 J„) JJ m^(t) , (19) 
(t+l)=tanh ^ tanh -1 rri^i(t) + a + tanh" 1 777^) , (20) 



with the pseudo-posterior expresion 



m 



(t)=tanh( ^2 tann 1 m^(t) + a + t&nh 1 7TWi(t)| . (21) 

fj,eM(i) 



The set of equations (|T9*|) and (j20j) give an iterative algorithm for code generation. In our 
dynamics model, the choice of parameter 7 = results in naive TAP equation without 
the reaction term Therefore, in this case, the dynamics can be strictly captured using 
the method of "density evolution" proposed by Richardson and Urbanke in the context of 
determining the capacity of low-density parity-check (LDPC) codes under message-passing 
decoding jlOj. Let p(-) denote the common density of (1//3) tanh -1 m M j, and p(-) the density 
of (1/(3) tanh -1 rh^ respectively, it is easy to see the set of probability distributions should 
satisfy the saddle point equations (jH)) and Q. It is quite interesting to find the consistency 
between the information theoretic "density evolution" technique and the replica theory for 
disordered statistical systems. Therefore the similarity between 7r and p (or ft and p) can 
be considered as an important measure for good encoding, giving the design principle for 
dynamics model [FIG.^. Finally, the equation (}2"Tj) provides the Bayes optimal encoding 
rhi(t) = sign(mi(t)) in the Ising representation. 
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FIG. 2: Empirical performance: Numerical experiments show that the algorithm with optimal 
a = and (5 = (5 g can achieve the bound for sparse construction of the codes, where K = 2 and 
L = 3, 4, ■ • ■ ,12. We choose 7 = 0.01 for C = 3 and 7 = 0.1 for the rest. Solid line denotes the rate- 
distortion function R{D) for binary sequences by Shannon, while dashed line can be easily achieved 
by universal lossless coding techniques, (o) Numerical results with the system size N = 20000, 
averaged over 10 trials for each evaluation. (□) Theoretical bound for the sparse construction 
obtained by the 10 5 bin model for Monte Carlo sampling in the replica framework. 

Practical encoding scheme for this compression model will be as follows. Given the source 
sequence J, we first translate the Boolean alphabets into that of Ising ones. Then, for a cer- 
tain set of control parameters a, [3 and 7, the equations (flUj) . PUJ) are recursively calculated 
until they converge to a certain fix point. Lastly, according to the equation (J2Tj) . we calculate 
the codeword sequence £ from the Boolean translation of rh = sign(m). Notice that the 
decoding process will be just a linear mapping. The most interesting quantity to examine is 
clearly the minimum typical distortion for a given compression rate. Empirical results are 
shown in FIG. El together with the theoretical evaluation for the code constructions using 
the replica method. We use the optimal inverse temerature (3 g for code generation, where 
per-bit entropy vanishes at the freezing point. Recent works in the information science reveal 
that designing the codes which approach to the Shannon's limit R{D) is quite difficult in the 
practical sense; we do not have good coding methods of low complexity, especially of O(N). 
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Our code construction, however, takes only O(N), and the performance is surprisingly good. 
We believe that the physicist approach can play an important role in the lossy compression 
schemes, as we have already seen in the context of the error- correcting codes. 

Future directions of the current research include utilizing more refined approximation 
techniques to find better coding schemes for lossy compression, as well as the evaluation 
of the trade-off relation between performance and computational costs. These tasks are 
interesting and challenging. 
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